Lecture based on material from Tom Mitchell's lecture 2, Nina Balcan's lecture 2, and on on Kilian Weinberger's lecture 17
The perceptron algorithm converges in $\frac{1}{\gamma^2}$ updates if the data is linearly separable.
$\gamma$ is the margin of the problem instance (defined on next slide).
Given:
If all of the above holds, then the Perceptron algorithm makes at most $\frac{1}{\gamma^2}$ mistakes.
Learn concept PlayTennis (i.e., decide whether our friend will play tennis in a given day)
Example: x=(Outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High), h(x)=Yes
Suppose $X = (X_1,X_2,… X_n)$ where $X_i$ are boolean-valued variables
How would you represent $Y = X2\land X5?$ $~ ~ ~ ~ ~ $ $Y = X2\lor X5?$
How would you represent $X_2X_5\lor X_3X_4(\neg X1)?$
| $$X_1$$ | $$X_2$$ | $$X_3$$ | $$Y$$ | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | |
| 0 | 0 | 1 | 0 | |
| 0 | 1 | 0 | 1 | |
| 0 | 1 | 1 | 1 | |
| 1 | 0 | 0 | 1 | |
| 1 | 0 | 1 | 1 | |
| 1 | 1 | 0 | 1 | |
| 1 | 1 | 1 | 1 |
How to automatically find a good hypothesis for training data?
When do we generalize and do well on unseen data?
How to automatically find a good hypothesis for training data?
When do we generalize and do well on unseen data?
Other core questions in machine learning:
How to pick the smallest tree?
Luckily, we have very nice practical heuristics and top down algorithms (e.g, ID3) that can achieve a good solution
ID3: Natural greedy approach to growing a decision tree top-down (from the root to the leaves by repeatedly replacing an existing leaf with an internal node.).
Algorithm main loop:
• Entropy measures the complexity of 𝑆:
$$H_S = - p_1 \log(p_1) - p_0 \log(p_0)$$
E.g., if $c$ classes, all equally likely, then $p_k= \frac{1}{c}$ for all $k$ and $H_S = -\log_2$
We want to be the farthest possible from all classes being equally likely (let's call this uniform distribution $q_k= \frac{1}{c}$ for all $k$). This can be done by finding another distribution $(p_1,...p_c)$ that maximizes the KL-Divergence (Kullback–Leibler divergence).
The KL-Divergence is not a metric because it is not symmetric, i.e., $\text{KL}(p||q)\ne \text{KL}(q||p)$.)
Learn concept PlayTennis (i.e., decide whether our friend will play tennis in a given day)
Overfitting could occur because of noisy data and because ID3 is not guaranteed to output a small hypothesis even if one exists.
Consider adding a noisy example:
The tree we learned would not be compatible with the training data
Task: learning which medical patients have a form of diabetes.
Grow full tree, then prune it
example: Reduced Error Pruning
Random forests: use subsets of the data (different features)
Will see ensemble learning later in the couse